Observe MCP Server

OPAL Anomaly Detection Guide.md•20.7 KiB

# OPAL Anomaly Detection Guide for Microservices This guide provides comprehensive patterns and examples for implementing anomaly detection monitors in Observe using OPAL, with special focus on microservice latency monitoring. ## Table of Contents 1. [Basic Anomaly Detection Patterns](#basic-anomaly-detection-patterns) 2. [Enhanced Patterns with Timeshift](#enhanced-patterns-with-timeshift) 3. [Advanced Subquery Patterns](#advanced-subquery-patterns) 4. [Implementation Guidelines](#implementation-guidelines) 5. [Monitor Configuration Examples](#monitor-configuration-examples) 6. [Troubleshooting and Best Practices](#troubleshooting-and-best-practices) --- ## Basic Anomaly Detection Patterns ### 1. Statistical Z-Score Anomaly Detection **Best for: Detecting statistical outliers in service latency** ```opal align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | filter service_name = "your-service-name" | make_event | make_col running_avg: window(avg(p95_ms), frame(back:2h)) | make_col running_stddev: window(stddev(p95_ms), frame(back:2h)) | make_col z_score: abs(p95_ms - running_avg) / running_stddev | make_col is_outlier: z_score > 2.5 | make_col alert_message: concat_strings("Latency anomaly detected for ", service_name, " - P95: ", string(round(p95_ms, 2)), "ms (", string(round(z_score, 2)), " std devs)") | filter is_outlier = true ``` **Key Parameters:** - `frame(back:2h)`: Historical window size - `z_score > 2.5`: Sensitivity threshold (2.0 = more sensitive, 3.0 = less sensitive) - Time range: Use 6h+ for sufficient historical data **Monitor Configuration:** - **Alert condition**: `is_outlier = true` - **Notification**: Include `alert_message` for context ### 2. Percentile-Based Anomaly Detection **Best for: Detecting when latency exceeds historical norms** ```opal align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | make_event | make_col historical_p95: window(percentile(p95_ms, 0.95), frame(back:24h)) | make_col is_anomaly: p95_ms > historical_p95 * 2 | make_col severity: if(p95_ms > historical_p95 * 3, "Critical", if(p95_ms > historical_p95 * 2, "Warning", "Normal")) | filter is_anomaly = true | statsby anomaly_count: count(), max_latency: max(p95_ms), avg_latency: avg(p95_ms), group_by(service_name, severity) ``` **Key Parameters:** - `historical_p95 * 2`: Warning threshold (2x historical P95) - `historical_p95 * 3`: Critical threshold (3x historical P95) - `frame(back:24h)`: 24-hour baseline window ### 3. Multi-Service Latency Drift Detection **Best for: Detecting when multiple services show latency issues simultaneously** ```opal align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | make_event | make_col running_avg: window(avg(p95_ms), frame(back:1h)) | make_col drift_ratio: p95_ms / running_avg | make_col is_drift: drift_ratio > 1.8 | filter is_drift = true | statsby affected_services: count_distinct(service_name), max_drift: max(drift_ratio), avg_drift: avg(drift_ratio), group_by() | make_col is_widespread_issue: affected_services >= 3 | make_col alert_message: concat_strings("Infrastructure Alert: ", string(affected_services), " services showing high latency (avg ", string(round(avg_drift, 2)), "x normal)") | filter is_widespread_issue = true ``` **Use Case:** Detect infrastructure-wide latency issues affecting multiple services **Key Change:** Uses `count_distinct(service_name)` and higher threshold (1.8x) to reduce noise ### 4. EWMA (Exponentially Weighted Moving Average) Detection **Best for: Real-time anomaly detection with adaptive baselines** ```opal align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | filter service_name = "critical-service" | make_event | make_col ewma_baseline: window(ewma(p95_ms, 12), frame(back:2h)) | make_col deviation_pct: ((p95_ms - ewma_baseline) / ewma_baseline) * 100 | make_col is_anomaly: abs(deviation_pct) > 50 | make_col alert_level: if(abs(deviation_pct) > 100, "Critical", if(abs(deviation_pct) > 50, "Warning", "Normal")) | filter is_anomaly = true ``` **Key Parameters:** - `ewma(p95_ms, 12)`: 12-period EWMA (1 hour with 5-minute alignment) - `deviation_pct > 50`: 50% deviation threshold --- ## Enhanced Patterns with Timeshift ### 5. Multi-Timeframe Trend Analysis **Best for: Distinguishing between recent spikes vs. sustained performance degradation** ```opal make_col time_period: "current" @one_hour_ago <- @ { timeshift 1h make_col time_period: "1h_ago" } @six_hours_ago <- @ { timeshift 6h make_col time_period: "6h_ago" } union @one_hour_ago, @six_hours_ago | align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | filter service_name = "your-service" | statsby avg_p95: avg(p95_ms), group_by(service_name, time_period) | make_col current_baseline: if(time_period = "current", avg_p95, 0) | make_col one_hour_baseline: if(time_period = "1h_ago", avg_p95, 0) | make_col six_hour_baseline: if(time_period = "6h_ago", avg_p95, 0) | statsby current_latency: sum(current_baseline), one_hour_latency: sum(one_hour_baseline), six_hour_latency: sum(six_hour_baseline), group_by(service_name) | filter current_latency > 0 and one_hour_latency > 0 and six_hour_latency > 0 | make_col short_term_change: ((current_latency - one_hour_latency) / one_hour_latency) * 100 | make_col long_term_change: ((current_latency - six_hour_latency) / six_hour_latency) * 100 | make_col trend_classification: if(short_term_change > 20 and long_term_change > 20, "Sustained Increase", if(short_term_change > 20 and long_term_change < 5, "Recent Spike", if(abs(short_term_change) < 10 and abs(long_term_change) < 10, "Stable", "Variable"))) | make_col alert_message: concat_strings("Trend Alert: ", service_name, " - ", trend_classification, " (", string(round(short_term_change, 1)), "% short-term, ", string(round(long_term_change, 1)), "% long-term)") | filter trend_classification != "Stable" ``` **Trend Classifications:** - **Sustained Increase**: Problem getting worse over time - **Recent Spike**: Sudden spike but historically normal - **Variable**: Inconsistent patterns needing investigation ### 6. Day-over-Day Baseline Comparison **Best for: Detecting when current performance significantly deviates from same time yesterday** ```opal make_col series: "current" @yesterday <- @ { timeshift 24h make_col series: "yesterday" } union @yesterday | align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | filter service_name = "your-service" | statsby current_p95: avg(if(series = "current", p95_ms, 0)), yesterday_p95: avg(if(series = "yesterday", p95_ms, 0)), current_count: sum(if(series = "current", 1, 0)), yesterday_count: sum(if(series = "yesterday", 1, 0)), group_by(service_name) | filter current_count > 0 and yesterday_count > 0 | make_col latency_change_pct: ((current_p95 - yesterday_p95) / yesterday_p95) * 100 | make_col is_anomaly: abs(latency_change_pct) > 30 | make_col severity: if(latency_change_pct > 75, "Critical Increase", if(latency_change_pct > 30, "Warning Increase", if(latency_change_pct < -30, "Significant Decrease", "Normal"))) | make_col alert_message: concat_strings("Day-over-Day Alert: ", service_name, " latency changed ", string(round(latency_change_pct, 1)), "% vs yesterday (", string(round(yesterday_p95, 2)), "ms → ", string(round(current_p95, 2)), "ms)") | filter is_anomaly = true ``` **Benefits:** - Accounts for daily traffic patterns - Provides contextual comparison with actual values - Effective for services with predictable daily cycles --- ## Advanced Subquery Patterns ### 7. Statistical Confidence with Subqueries **Best for: High-confidence anomaly detection using statistical rigor** ```opal @current_metrics <- @ { align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 statsby current_p95: avg(p95_ms), current_stddev: stddev(p95_ms), current_count: count(), group_by(service_name) make_col time_period: "current" } @historical_baseline <- @ { timeshift 24h align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 statsby baseline_p95: avg(p95_ms), baseline_stddev: stddev(p95_ms), baseline_count: count(), group_by(service_name) make_col time_period: "baseline" } <- @current_metrics { join service_name = @historical_baseline.service_name, baseline_p95: @historical_baseline.baseline_p95, baseline_stddev: @historical_baseline.baseline_stddev, baseline_count: @historical_baseline.baseline_count make_col z_score: (current_p95 - baseline_p95) / baseline_stddev make_col is_significant_anomaly: abs(z_score) > 2.0 and current_count >= 10 and baseline_count >= 10 make_col anomaly_type: if(z_score > 2.0, "Performance Degradation", if(z_score < -2.0, "Performance Improvement", "Normal")) make_col confidence: if(abs(z_score) > 3.0, "Very High", if(abs(z_score) > 2.5, "High", if(abs(z_score) > 2.0, "Medium", "Low"))) make_col abs_z_score: abs(z_score) make_col alert_message: concat_strings("Statistical Anomaly: ", service_name, " - ", anomaly_type, " (Z-score: ", string(round(z_score, 2)), ", Confidence: ", confidence, ")") filter is_significant_anomaly = true sort desc(abs_z_score) } ``` **Key Features:** - **Statistical rigor**: Uses Z-scores for confidence levels - **Data quality checks**: Requires minimum sample sizes (>=10) - **Confidence scoring**: Very High/High/Medium confidence levels - **Sample size validation**: Ensures statistical significance ### 8. Multi-Service Correlation Detection **Best for: Detecting infrastructure-wide issues vs isolated service problems** ```opal make_col series: "current" @baseline <- @ { timeshift 2h make_col series: "baseline" } union @baseline | align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) | make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 | statsby current_avg: avg(if(series = "current", p95_ms, 0)), baseline_avg: avg(if(series = "baseline", p95_ms, 0)), current_samples: sum(if(series = "current", 1, 0)), baseline_samples: sum(if(series = "baseline", 1, 0)), group_by(service_name) | filter current_samples > 0 and baseline_samples > 0 | make_col latency_increase_pct: ((current_avg - baseline_avg) / baseline_avg) * 100 | make_col is_degraded: latency_increase_pct > 25 | statsby total_services: count(), degraded_services: sum(if(is_degraded, 1, 0)), avg_degradation: avg(if(is_degraded, latency_increase_pct, 0)), worst_service: any(if(is_degraded, service_name, "")), group_by() | make_col degradation_ratio: degraded_services / total_services | make_col incident_type: if(degradation_ratio >= 0.5, "Infrastructure Issue", if(degraded_services >= 3, "Multi-Service Issue", if(degraded_services >= 1, "Isolated Service Issue", "Normal"))) | make_col severity: if(degradation_ratio >= 0.7, "Critical", if(degradation_ratio >= 0.4, "Major", if(degraded_services >= 2, "Minor", "Info"))) | make_col alert_message: concat_strings("Correlation Alert: ", incident_type, " detected - ", string(degraded_services), "/", string(total_services), " services affected (", string(round(avg_degradation, 1)), "% avg increase)") | filter incident_type != "Normal" ``` **Incident Classifications:** - **Infrastructure Issue**: ≥50% of services affected (likely platform/network) - **Multi-Service Issue**: ≥3 services affected (possible shared dependency) - **Isolated Service Issue**: Single service (application-specific) ### 9. Adaptive Threshold with Historical Windows **Best for: Services with varying traffic patterns that need dynamic thresholds** ```opal @short_window <- @ { align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 statsby recent_avg: avg(p95_ms), recent_stddev: stddev(p95_ms), group_by(service_name) make_col window_type: "recent" } @long_window <- @ { timeshift 4h align 5m, duration_combined: tdigest_combine(m_tdigest("span_duration_5m")) make_col p95_ms: tdigest_quantile(duration_combined, 0.95) / 1000000 statsby historical_avg: avg(p95_ms), historical_stddev: stddev(p95_ms), historical_p95: percentile(p95_ms, 0.95), group_by(service_name) make_col window_type: "historical" } <- @short_window { join service_name = @long_window.service_name, historical_avg: @long_window.historical_avg, historical_stddev: @long_window.historical_stddev, historical_p95: @long_window.historical_p95 make_col adaptive_upper_threshold: historical_avg + (historical_stddev * 2.5) make_col adaptive_critical_threshold: historical_p95 * 1.5 make_col is_warning_anomaly: recent_avg > adaptive_upper_threshold make_col is_critical_anomaly: recent_avg > adaptive_critical_threshold make_col severity: if(is_critical_anomaly, "Critical", if(is_warning_anomaly, "Warning", "Normal")) make_col deviation_from_normal: ((recent_avg - historical_avg) / historical_avg) * 100 make_col alert_message: concat_strings("Adaptive Threshold Alert: ", service_name, " - ", severity, " (", string(round(recent_avg, 2)), "ms vs ", string(round(adaptive_upper_threshold, 2)), "ms threshold, ", string(round(deviation_from_normal, 1)), "% deviation)") filter severity != "Normal" sort desc(deviation_from_normal) } ``` **Benefits:** - **Dynamic thresholds**: Automatically adapts to service behavior patterns - **Dual threshold system**: Warning (statistical) + Critical (percentile-based) - **Reduces false positives**: Accounts for natural variation in service performance --- ## Implementation Guidelines ### Service Tier Recommendations | Service Tier | Pattern Recommendations | Thresholds | Time Windows | |-------------|------------------------|------------|--------------| | **Tier 1 (Critical)** | Statistical Confidence + Adaptive Threshold | Z-score > 2.0, 30%+ change | 24h baseline | | **Tier 2 (Important)** | Multi-Timeframe + Day-over-Day | Z-score > 2.5, 50%+ change | 8h windows | | **Tier 3 (Standard)** | Percentile-Based + EWMA | Z-score > 3.0, 75%+ change | 4h windows | | **Infrastructure** | Multi-Service Correlation | 3+ services, 25%+ degradation | 2h baseline | ### Query Time Windows and Performance | Pattern Type | Minimum Time Range | Recommended Range | Performance Impact | |-------------|-------------------|------------------|-------------------| | **Basic Statistical** | 6h | 12h | Low | | **Timeshift (24h)** | 48h | 72h | Medium | | **Multi-timeframe** | 8h | 12h | Medium | | **Subqueries** | 24h | 48h | High | | **Correlation** | 4h | 8h | Medium | ### Alert Threshold Guidelines #### Latency Thresholds by Service Type: ``` API Services: P95 > 200ms (Warning), P95 > 500ms (Critical) Database Services: P95 > 50ms (Warning), P95 > 100ms (Critical) Background Jobs: P95 > 5s (Warning), P95 > 15s (Critical) Frontend Services: P95 > 1s (Warning), P95 > 3s (Critical) ``` #### Statistical Thresholds: ``` Z-score > 2.0: Medium confidence (95% certainty) Z-score > 2.5: High confidence (99% certainty) Z-score > 3.0: Very high confidence (99.7% certainty) ``` --- ## Monitor Configuration Examples ### Example 1: Critical Service Monitor (Tier 1) ```yaml Monitor Name: "Critical Service Latency Anomaly - Statistical" Dataset: ServiceExplorer/Service Metrics OPAL Query: [Pattern #7 - Statistical Confidence] Alert Condition: is_significant_anomaly = true AND confidence IN ("High", "Very High") Time Range: 48h Refresh: 5m Notifications: - Slack: #critical-alerts - Email: oncall-team@company.com - PagerDuty: P1 escalation ``` ### Example 2: Multi-Service Infrastructure Monitor ```yaml Monitor Name: "Multi-Service Correlation Detection" Dataset: ServiceExplorer/Service Metrics OPAL Query: [Pattern #8 - Multi-Service Correlation] Alert Condition: incident_type IN ("Infrastructure Issue", "Multi-Service Issue") AND severity IN ("Critical", "Major") Time Range: 8h Refresh: 2m Notifications: - Slack: #infrastructure-alerts - Email: platform-team@company.com ``` ### Example 3: Standard Service Monitor (Tier 3) ```yaml Monitor Name: "Standard Service - Day over Day Comparison" Dataset: ServiceExplorer/Service Metrics OPAL Query: [Pattern #6 - Day-over-Day Baseline] Alert Condition: is_anomaly = true AND severity IN ("Warning Increase", "Critical Increase") Time Range: 48h Refresh: 15m Notifications: - Slack: #service-alerts - Email: dev-team@company.com ``` --- ## Troubleshooting and Best Practices ### Common Issues and Solutions #### 1. **False Positives** **Problem**: Too many alerts during normal traffic variations **Solutions:** - Increase Z-score thresholds (2.0 → 2.5 → 3.0) - Use longer historical windows (2h → 4h → 24h) - Add minimum sample size requirements - Use adaptive thresholds for variable services #### 2. **Missed Anomalies** **Problem**: Real issues not being detected **Solutions:** - Decrease Z-score thresholds (3.0 → 2.5 → 2.0) - Reduce historical window size for faster detection - Use multiple detection methods (combine patterns) - Check data quality and completeness #### 3. **Query Performance Issues** **Problem**: Monitors taking too long or timing out **Solutions:** - Use shorter time ranges where possible - Avoid complex subqueries for high-frequency monitors - Consider data acceleration for frequently queried datasets - Use simpler patterns for less critical services #### 4. **Data Quality Problems** **Problem**: Inconsistent or missing metrics **Solutions:** - Add data validation checks (minimum sample sizes) - Filter out null/invalid values - Use `if_null()` functions for missing data handling - Monitor data ingestion separately ### Best Practices #### Monitor Design 1. **Start Conservative**: Begin with higher thresholds and adjust down 2. **Layer Detection**: Use multiple patterns for critical services 3. **Context in Alerts**: Always include current vs baseline values 4. **Test with Historical Data**: Validate against known incidents #### Performance Optimization 1. **Use Appropriate Time Windows**: Don't query more data than needed 2. **Leverage Data Acceleration**: For frequently accessed datasets 3. **Monitor Monitor Performance**: Track query execution times 4. **Batch Similar Monitors**: Group related alerts to reduce query load #### Alert Management 1. **Progressive Alerting**: Warning → Critical → Emergency escalation 2. **Context-Rich Messages**: Include troubleshooting information 3. **Runbook Links**: Include investigation procedures 4. **Alert Fatigue Prevention**: Tune thresholds to maintain signal-to-noise ratio ### Performance Monitoring for Monitors ```opal // Monitor your monitors - track execution times and success rates statsby avg_execution_time: avg(execution_duration), success_rate: (sum(if(status = "success", 1, 0)) / count()) * 100, group_by(monitor_name) | filter success_rate < 95 or avg_execution_time > 30000 // 30 second threshold ``` ### Data Quality Checks ```opal // Validate data completeness before anomaly detection statsby total_services: count_distinct(service_name), total_samples: count(), null_latency_count: sum(if(is_null(p95_ms), 1, 0)), group_by() | make_col data_quality_score: ((total_samples - null_latency_count) / total_samples) * 100 | filter data_quality_score < 95 // Alert if data quality drops below 95% ``` --- ## Conclusion This guide provides a comprehensive foundation for implementing anomaly detection in Observe. Start with basic patterns for standard services, then evolve to advanced patterns for critical systems. Remember to: 1. **Match patterns to service criticality** - Use sophisticated detection for important services 2. **Balance sensitivity vs noise** - Tune thresholds to maintain alert quality 3. **Include context in alerts** - Make alerts actionable with baseline comparisons 4. **Monitor your monitors** - Ensure detection systems are performing well 5. **Iterate and improve** - Continuously refine based on operational feedback For additional support or advanced use cases, consult the Observe documentation or reach out to the community.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rustomax/observe-experimental-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

OPAL Anomaly Detection Guide.md•20.7 KiB